Skip to content

[cudax] Implement cudax::coop::reduce for warp groups within a block#9258

Open
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce_warp_groups
Open

[cudax] Implement cudax::coop::reduce for warp groups within a block#9258
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce_warp_groups

Conversation

@davebayer
Copy link
Copy Markdown
Contributor

Fixes #9160.

@davebayer davebayer requested a review from a team as a code owner June 4, 2026 12:32
@davebayer davebayer requested a review from andralex June 4, 2026 12:32
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR implements cudax::coop::reduce for arbitrary warp-groupings within a CUDA block (fixes #9160). It adds a warp-group reduction path that performs a two-stage reduction across warps inside a block, adjusts the public wrapper to take the reduction functor by value, and adds Catch2 tests validating multi-warp reductions. Several supporting group/mapping/synchronizer fixes were included to ensure correct rank/mapping and barrier initialization when units are not at thread level.

Core Implementation Changes

  • cudax/include/cuda/experimental/__coop/reduce.cuh
    • Added an overload of __reduce_impl constrained to groups with unit_type == warp_level and level_type == block_level.
    • Two-stage reduction:
      • Per-warp reductions using cub::WarpReduce with per-warp TempStorage; each warp writes one partial into shared memory.
      • Synchronize, then reduce per-warp partials using a root-warp cub::WarpReduce; only the group root returns a cuda::std::optional containing the final reduced value, other threads return nullopt.
    • Public wrapper signature changed from reduce(_Group, _Tp (&)[_Np], _RedFn&&)reduce(_Group, _Tp (&)[_Np], _RedFn) (functor passed by value and forwarded into the impl).

Supporting Changes

  • cudax/include/cuda/experimental/__group/queries.cuh

    • In the __rank_query_group non-matching branch, the linear rank computation now multiplies the group-unit rank by the unit count derived from __group.hierarchy() and adds the unit-local rank (uses __count_query/__rank_query from the hierarchy).
  • cudax/include/cuda/experimental/__group/group.cuh

    • Lane-mask validation in __do_mapping made conditional on _Unit: when _Unit == thread_level assert lane inclusion and popcount constraints; otherwise require lane mask == all lanes (previously always validated per-lane).
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh

    • make_instance now precomputes per-Unit thread metadata (__nthread_in_unit, __thread_rank_in_unit) and derives them from the parent hierarchy when _Unit != thread_level.
    • Barrier initialization guard tightened to require mapping valid, mapping rank == 0, and __thread_rank_in_unit == 0.
    • Barrier init count scaled to __mapping_result.count() * __nthread_in_unit to account for multiple threads per unit.

Tests

  • cudax/test/coop/reduce/warps_within_block.cu (new)

    • Adds ReduceKernel and parametrized tests that create groups of nwarps_in_group warps and verify cudax::coop::reduce across multiple warps.
    • Tests:
      • Integral types with integer reduction operators (plus/mul/bitwise/min/max) — exact equality.
      • Floating-point types (float/double) with FP ops (plus/mul/min/max) — relative tolerance.
      • Per-thread input sizes 1–4.
    • Kernel runtime config supports block size = (nwarps_in_group + 2) * warp_size; only group-member threads participate in the reduction and only group root writes output.
  • cudax/test/CMakeLists.txt

    • Registers new test target coop.reduce.warps_within_block.

API / ABI Surface Changes

  • Added internal overload: __reduce_impl(_Group, _Tp (&)[_Np], _RedFn) for warp-group reductions.
  • Changed public reduce signature to take the reduction functor by value (_RedFn), instead of an rvalue reference.

Review Notes / Areas to Verify

  • Shared-memory layout and sizing for per-warp partials and correctness of root-warp selection (ensure alignment and no bank-conflict issues).
  • Correctness and performance across varying warp-group sizes and SM architectures (two-stage approach and root-warp reduction).
  • Impact of changing public reduce functor parameter to pass-by-value on forwarding/move semantics.
  • Barrier initialization and mapping correctness when units contain multiple threads (ensure counts/ranks align with mapping and that lane-mask validation changes are appropriate).

important:

Walkthrough

Adds a warp-scoped two-stage cooperative reduce overload, conditions group mapping/queries on unit hierarchy and adjusts barrier init for per-unit threads, and adds parametrized CUDA tests validating multi-warp reductions across types, ops, and item counts.

Changes

Cooperative reduce for warps within block

Layer / File(s) Summary
Group mapping lane-mask validation
cudax/include/cuda/experimental/__group/group.cuh
Make post-mapping lane-mask assertions conditional on _Unit: keep inclusion/popcount checks for thread_level, require lane_mask::all() for others.
Group hierarchy ranking correction
cudax/include/cuda/experimental/__group/queries.cuh
Compute __unit_count via __count_query<_Unit,_GroupUnit>(__group.hierarchy()) and return linear rank as __group_unit_rank * __unit_count + __unit_rank.
Barrier synchronizer initialization fix
cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
Compute per-unit thread metadata for non-thread_level, require __thread_rank_in_unit == 0 for barrier init, and scale init count by __nthread_in_unit.
Core warp-level reduce implementation
cudax/include/cuda/experimental/__coop/reduce.cuh
Add constrained __reduce_impl performing per-warp cub::WarpReduce into shared memory, then a root-warp final cub::WarpReduce; only root warp/lane returns optional result. Change public reduce to take reduction functor by value and forward it.
Test validation for warp-level reduce
cudax/test/CMakeLists.txt, cudax/test/coop/reduce/warps_within_block.cu
Add coop.reduce.warps_within_block test target and parametrized tests: ReduceKernel, helpers, and tests for integral and floating-point types/operators, verifying NumItems=1..4 outputs (exact for integrals, WithinRel for floats).

Assessment against linked issues

Objective Addressed Explanation
Implement cudax::coop::reduce for arbitrary warps within a block groups [#9160]

Possibly related PRs

  • NVIDIA/cccl#9167: Adds other __reduce_impl overloads and related tests; closely aligned with multi-overload reduce work.
  • NVIDIA/cccl#9154: Earlier changes to __reduce_impl overloads and reduce wrapper plumbing; directly related.
  • NVIDIA/cccl#9264: Similar conditionalization of lane-mask validation in group.cuh; related to group mapping changes.

Suggested labels

cudax

Suggested reviewers

  • andralex
  • miscco
  • pciolkosz

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cudax/test/coop/reduce/warps_within_block.cu (1)

152-153: ⚡ Quick win

suggestion: The test names and tags still say this_warp on Lines 152 and 173, but this file/target is warps_within_block. Rename the C2H_TEST names/tags to warps_within_block to keep filtering and failure triage unambiguous.

Also applies to: 173-174


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9ab43e4-f31a-4ae2-94bb-fdad331fffa3

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7cb8b and 3aebfa2.

📒 Files selected for processing (5)
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/include/cuda/experimental/__group/queries.cuh
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
  • cudax/test/CMakeLists.txt
  • cudax/test/coop/reduce/warps_within_block.cu

Comment thread cudax/test/coop/reduce/warps_within_block.cu
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@davebayer davebayer force-pushed the cudax_coop_reduce_warp_groups branch from 3aebfa2 to 05ebea1 Compare June 5, 2026 11:47
@github-actions

This comment has been minimized.

@davebayer davebayer force-pushed the cudax_coop_reduce_warp_groups branch from 090635a to a7140fb Compare June 8, 2026 20:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a491f9e-17cf-4aba-a115-6d696e80c0f0

📥 Commits

Reviewing files that changed from the base of the PR and between 090635a and a7140fb.

📒 Files selected for processing (6)
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/include/cuda/experimental/__group/group.cuh
  • cudax/include/cuda/experimental/__group/queries.cuh
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
  • cudax/test/CMakeLists.txt
  • cudax/test/coop/reduce/warps_within_block.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
  • cudax/test/CMakeLists.txt
  • cudax/include/cuda/experimental/__group/queries.cuh
  • cudax/test/coop/reduce/warps_within_block.cu
  • cudax/include/cuda/experimental/__coop/reduce.cuh

Comment thread cudax/include/cuda/experimental/__group/group.cuh
@davebayer davebayer requested review from pciolkosz and removed request for andralex June 8, 2026 20:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

🥳 CI Workflow Results

🟩 Finished in 35m 02s: Pass: 100%/55 | Total: 8h 09m | Max: 35m 00s | Hits: 71%/46192

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement cudax::coop::reduce for arbitrary warps within a block groups

1 participant